This document describes the analysis of Cholesteatoma DNA-seq exome samples. The analysis includes 9 families with a minimum of 3 samples - all of whom have Cholesteatoma. This project aimed to identify novel DNA mutations that could be attributed to the inheritance and aetiology of the Cholesteatoma disease.
FASTQ files were first trimmed for TruSeq adapter sequences (AATGATACGGCGACCACCGAGATCTACAACACGACGCTCTTCCGATCT, GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG) and FastQC analysis (Figure 1). Samples were then aligned using the docker image cgpMAP (https://github.com/cancerit/dockstore-cgpmap) which utilises bwa-mem - genome GRCh38_hla_decoy_ebv.
BAM files were sorted using SAMtools and then analysed using Picard tools for QC. QC analysis included remove duplicates erroneously generated through PCR and sequencing (optical duplicates). Exome-seq capture of targets were analysed using HS.metrics. A summary of the tools used is shown below:
Figure 1 - pipeline steps for variant calling
Both single base subsitutions and deletions were analysed using two tools - Freebayes and GATK haplotypeCaller. Freebayes was run individually on samples and the filtered using recommended paramters (https://github.com/ekg/freebayes) - using VCFtools QUAL > 20. GATK best practices workflow for Germline short variant discovery (SNPs + Indels) was used to identify variants (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-).
Variants from Freebayes and GATK were overlapped to give sites that are present only in both. Variant files were then filtered for allele frequencies using the GnomAD variant resource (v2 GRCh38 liftover - https://gnomad.broadinstitute.org). The allele frequencies groups were filtered using total AF (AF) and AF from european non-finnish descent (AF_nfe), AF < 0.01 for both. This parameter was chosen to include only rare variants. Annotation of variants was performed by ensembl VEP tool.
Below shows the overlap between the two variant callers used. Variants that were identified by both tools were used for downstream analysis in this report.
In order to identify genes that are most likely to be damaging the following filters were applied to the dataset.
The SNV and Indels were filtered for the following:
NOT CONTAINING:
CONTAINING:
This analysis used the filtered list of variants predicted to be deleterious to the function of the protein. We next overlapped samples from the same family, which will help us identify deleterious variants that are present in affected family members and thus may contribute to the Cholesteatoma condition.
Below shows a list of shared variants from each family:
Using variants that are common within families, we used the genes affected to overlap across families. Therefore, the table below gives a list of predicted deleterious genes in across all Cholesteatoma families that may contribute to the disease.
In summary, this analysis identified a number of consensus variants that were analysed using two independent variant callers - GATK HaplotypeCaller and Freebayes. We have identified a number of genes with predicted damaging variants that are inherited within affected Cholesteatoma patients which are also present at gene level across Cholesteatoma families. This analysis has given a number of encouraging mutations that may be candidates for further experimentation.